========================================================
This analysis is on arabica coffee which accounts for 60% of the world’s coffee production. The dataset contains measures of the quality of individual coffee samples.
These datasets are gathered from Coffee Quality Institute (CQI) in January, 2018. Data website: https://www.kaggle.com/volpatto/coffee-quality-database-from-cqi
https://en.wikipedia.org/wiki/Coffea_arabica https://blog.hunterlab.com/blog/color-food-industry/spectrophotometric-color-evaluation-of-green-coffee-beans-for-optimal-quality-and-consistency/ https://www.coffeechemistry.com/quality/cupping/cupping-fundamentals https://www.thirdwavecoffeeroasters.com/blogs/blog/what-is-speciality-coffee-and-how-is-it-graded
The over goal is to review the dataset and see if I can discover some commonalities with low or high ranking coffee beans and make some correlations as to the correct scenario for delicious coffee!
There are 1311 rows of data and 44 columns
## [1] 1311 44
## 'data.frame': 1311 obs. of 44 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Species : chr "Arabica" "Arabica" "Arabica" "Arabica" ...
## $ Owner : chr "metad plc" "metad plc" "grounds for health admin" "yidnekachew dabessa" ...
## $ Country.of.Origin : chr "Ethiopia" "Ethiopia" "Guatemala" "Ethiopia" ...
## $ Farm.Name : chr "metad plc" "metad plc" "san marcos barrancas \"san cristobal cuch" "yidnekachew dabessa coffee plantation" ...
## $ Lot.Number : chr "" "" "" "" ...
## $ Mill : chr "metad plc" "metad plc" "" "wolensu" ...
## $ ICO.Number : chr "2014/2015" "2014/2015" "" "" ...
## $ Company : chr "metad agricultural developmet plc" "metad agricultural developmet plc" "" "yidnekachew debessa coffee plantation" ...
## $ Altitude : chr "1950-2200" "1950-2200" "1600 - 1800 m" "1800-2200" ...
## $ Region : chr "guji-hambela" "guji-hambela" "" "oromia" ...
## $ Producer : chr "METAD PLC" "METAD PLC" "" "Yidnekachew Dabessa Coffee Plantation" ...
## $ Number.of.Bags : int 300 300 5 320 300 100 100 300 300 50 ...
## $ Bag.Weight : chr "60 kg" "60 kg" "1" "60 kg" ...
## $ In.Country.Partner : chr "METAD Agricultural Development plc" "METAD Agricultural Development plc" "Specialty Coffee Association" "METAD Agricultural Development plc" ...
## $ Harvest.Year : chr "2014" "2014" "" "2014" ...
## $ Grading.Date : chr "April 4th, 2015" "April 4th, 2015" "May 31st, 2010" "March 26th, 2015" ...
## $ Owner.1 : chr "metad plc" "metad plc" "Grounds for Health Admin" "Yidnekachew Dabessa" ...
## $ Variety : chr "" "Other" "Bourbon" "" ...
## $ Processing.Method : chr "Washed / Wet" "Washed / Wet" "" "Natural / Dry" ...
## $ Aroma : num 8.67 8.75 8.42 8.17 8.25 8.58 8.42 8.25 8.67 8.08 ...
## $ Flavor : num 8.83 8.67 8.5 8.58 8.5 8.42 8.5 8.33 8.67 8.58 ...
## $ Aftertaste : num 8.67 8.5 8.42 8.42 8.25 8.42 8.33 8.5 8.58 8.5 ...
## $ Acidity : num 8.75 8.58 8.42 8.42 8.5 8.5 8.5 8.42 8.42 8.5 ...
## $ Body : num 8.5 8.42 8.33 8.5 8.42 8.25 8.25 8.33 8.33 7.67 ...
## $ Balance : num 8.42 8.42 8.42 8.25 8.33 8.33 8.25 8.5 8.42 8.42 ...
## $ Uniformity : num 10 10 10 10 10 10 10 10 9.33 10 ...
## $ Clean.Cup : num 10 10 10 10 10 10 10 10 10 10 ...
## $ Sweetness : num 10 10 10 10 10 10 10 9.33 9.33 10 ...
## $ Cupper.Points : num 8.75 8.58 9.25 8.67 8.58 8.33 8.5 9 8.67 8.5 ...
## $ Total.Cup.Points : num 90.6 89.9 89.8 89 88.8 ...
## $ Moisture : num 0.12 0.12 0 0.11 0.12 0.11 0.11 0.03 0.03 0.1 ...
## $ Category.One.Defects : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Quakers : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Color : chr "Green" "Green" "" "Green" ...
## $ Category.Two.Defects : int 0 1 0 2 2 1 0 0 0 4 ...
## $ Expiration : chr "April 3rd, 2016" "April 3rd, 2016" "May 31st, 2011" "March 25th, 2016" ...
## $ Certification.Body : chr "METAD Agricultural Development plc" "METAD Agricultural Development plc" "Specialty Coffee Association" "METAD Agricultural Development plc" ...
## $ Certification.Address: chr "309fcf77415a3661ae83e027f7e5f05dad786e44" "309fcf77415a3661ae83e027f7e5f05dad786e44" "36d0d00a3724338ba7937c52a378d085f2172daa" "309fcf77415a3661ae83e027f7e5f05dad786e44" ...
## $ Certification.Contact: chr "19fef5a731de2db57d16da10287413f5f99bc2dd" "19fef5a731de2db57d16da10287413f5f99bc2dd" "0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660" "19fef5a731de2db57d16da10287413f5f99bc2dd" ...
## $ unit_of_measurement : chr "m" "m" "m" "m" ...
## $ altitude_low_meters : num 1950 1950 1600 1800 1950 ...
## $ altitude_high_meters : num 2200 2200 1800 2200 2200 NA NA 1700 1700 1850 ...
## $ altitude_mean_meters : num 2075 2075 1700 2000 2075 ...
Changing Data to Eliminate Certain Columns There are cetain columns that are not needed in my analysis either because there are too many empty values or because it’s the same data twice or because it doesn’t really tell me much. So I will change my dataset to remove those columns to narrow things down.
Now it’s 24 columns.
In this section I will be performing preliminary exploration of the dataset. I will run some summaries of the data, clean the data, and create some plots to understand the structure of my variables.
## [1] FALSE
I am interested in the farms, countries, and harvest year, but before I try to plot that, I am curious as to the number of unique values.
There are 558 unique farm names, There are 37 unique countries,
There are 47 unique harvest years, There are 557 unique expiration dates.
## [1] 558
## [1] 37
## [1] 47
## [1] 557
It’s time to do some cleaning. I’m noticing a lot of inconsistent data in Harvest.Year that won’t be good for plotting. Expiration has string dates, and I’d really like years in numbers. In addition, I’d also like to have NAs rather than blanks so I can filter them out later.
Being that Harvest.Year in some cases had a range of year dates, like 2016-2017, and inconsistently, I’d like to just make the Harvest.Year the beginning Harvest Year. So I’m going to change the column name to be Harvest.Year.Begin.
The data is so inconsistent, that it’s hard to find an automated formula that minimizes my time. So, being that the dataset is fairly small, I decided to manually change all the value(s).
##
## 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
## 2 20 30 36 352 199 245 153 129 87 1
Expiration I’d like to change to year, rather than month, day and year in a string format. Since the data was more consistent it was easier to clean.
##
## 018\n 2011 2012 2013 2014 2015 2016 2017 2018 2019
## 1 50 83 314 153 246 195 123 140 6
## Sample.ID Country.of.Origin Farm.Name Mill
## 962 962 Brazil fazendas klem ltda dry mill
## Region Harvest.Year.Begin Variety Processing.Method Aroma
## 962 matas de minas <NA> Catuai Natural / Dry 7.5
## Flavor Aftertaste Acidity Body Balance Uniformity Clean.Cup Sweetness
## 962 7.58 7.5 7.5 7.75 7.83 9.33 9.33 9.33
## Total.Cup.Points Moisture Category.One.Defects Color Category.Two.Defects
## 962 81.33 0.11 0 Green 0
## Expiration altitude_mean_meters
## 962 018\n 1100
Now the year fields gets changed to number fields
## num [1:1311] 2014 2014 NA 2014 2014 ...
## num [1:1311] 2016 2016 2011 2016 2016 ...
Information about dataset The dataset has 1311 observations.
To get an overall summary of the data.
What stands out is that the Harvest Year for this dataset starts at 2008, and ends at 2018. Also, most of the data is filled out, and the NAs really are just with the timeframe variables.
## Sample.ID Country.of.Origin Farm.Name Mill
## Min. : 1.0 Length:1311 Length:1311 Length:1311
## 1st Qu.: 328.5 Class :character Class :character Class :character
## Median : 656.0 Mode :character Mode :character Mode :character
## Mean : 656.0
## 3rd Qu.: 983.5
## Max. :1312.0
##
## Region Harvest.Year.Begin Variety Processing.Method
## Length:1311 Min. :2008 Length:1311 Length:1311
## Class :character 1st Qu.:2012 Class :character Class :character
## Mode :character Median :2013 Mode :character Mode :character
## Mean :2014
## 3rd Qu.:2015
## Max. :2018
## NA's :57
## Aroma Flavor Aftertaste Acidity
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:7.420 1st Qu.:7.330 1st Qu.:7.250 1st Qu.:7.330
## Median :7.580 Median :7.580 Median :7.420 Median :7.500
## Mean :7.564 Mean :7.518 Mean :7.398 Mean :7.533
## 3rd Qu.:7.750 3rd Qu.:7.750 3rd Qu.:7.580 3rd Qu.:7.750
## Max. :8.750 Max. :8.830 Max. :8.670 Max. :8.750
##
## Body Balance Uniformity Clean.Cup
## Min. :0.000 Min. :0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:7.330 1st Qu.:7.330 1st Qu.:10.000 1st Qu.:10.000
## Median :7.500 Median :7.500 Median :10.000 Median :10.000
## Mean :7.518 Mean :7.518 Mean : 9.833 Mean : 9.833
## 3rd Qu.:7.670 3rd Qu.:7.750 3rd Qu.:10.000 3rd Qu.:10.000
## Max. :8.580 Max. :8.750 Max. :10.000 Max. :10.000
##
## Sweetness Total.Cup.Points Moisture Category.One.Defects
## Min. : 0.000 Min. : 0.00 Min. :0.00000 Min. : 0.0000
## 1st Qu.:10.000 1st Qu.:81.17 1st Qu.:0.09000 1st Qu.: 0.0000
## Median :10.000 Median :82.50 Median :0.11000 Median : 0.0000
## Mean : 9.903 Mean :82.12 Mean :0.08886 Mean : 0.4264
## 3rd Qu.:10.000 3rd Qu.:83.67 3rd Qu.:0.12000 3rd Qu.: 0.0000
## Max. :10.000 Max. :90.58 Max. :0.28000 Max. :31.0000
##
## Color Category.Two.Defects Expiration altitude_mean_meters
## Length:1311 Min. : 0.000 Min. :2011 Min. : 1
## Class :character 1st Qu.: 0.000 1st Qu.:2013 1st Qu.: 1100
## Mode :character Median : 2.000 Median :2015 Median : 1311
## Mean : 3.592 Mean :2015 Mean : 1784
## 3rd Qu.: 4.000 3rd Qu.:2016 3rd Qu.: 1600
## Max. :55.000 Max. :2019 Max. :190164
## NA's :1 NA's :227
Let’s look at where the coffee comes from.
Most of the Arabica coffee in this dataset comes from Mexico. 236 of the samples come from Mexico. That’s 18% of the coffee.
##
## Brazil Burundi
## 132 2
## China Colombia
## 16 183
## Costa Rica Cote d?Ivoire
## 51 1
## Ecuador El Salvador
## 1 21
## Ethiopia Guatemala
## 44 181
## Haiti Honduras
## 6 53
## India Indonesia
## 1 20
## Japan Kenya
## 1 25
## Laos Malawi
## 3 11
## Mauritius Mexico
## 1 236
## Myanmar Nicaragua
## 8 26
## Panama Papua New Guinea
## 4 1
## Peru Philippines
## 10 5
## Rwanda Taiwan
## 1 75
## Tanzania, United Republic Of Thailand
## 40 32
## Uganda United States
## 26 8
## United States (Hawaii) United States (Puerto Rico)
## 73 4
## Vietnam Zambia
## 7 1
## [1] "18%"
Looking at Harvest Year from 2008 - 2018:
The max amount of production in coffee appeared in 2012. The outliers are 2008 and 2018. I’m assuming because the data collection occurred in the middle of those years.
##
## 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
## 2 20 30 36 352 199 245 153 129 87 1
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2008 2012 2013 2014 2015 2018 57
Looking at Expiration Dates - Most coffee is set to expire in 2013:
After faceting on harvest year, and knowing the majority of coffee production happened in 2012, it appears it expires 1 year later, which would make sense at 2013 being the most frequent expiration date.
The summary below seems to support that assumption of expiration dates.
## coffeedata$Harvest.Year.Begin: 2008
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2011 2011 2011 2011 2011 2011
## ------------------------------------------------------------
## coffeedata$Harvest.Year.Begin: 2009
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2011 2011 2011 2011 2011 2012
## ------------------------------------------------------------
## coffeedata$Harvest.Year.Begin: 2010
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2011 2011 2012 2012 2012 2012
## ------------------------------------------------------------
## coffeedata$Harvest.Year.Begin: 2011
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2012 2012 2012 2012 2012 2013
## ------------------------------------------------------------
## coffeedata$Harvest.Year.Begin: 2012
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2013 2013 2013 2013 2013 2015
## ------------------------------------------------------------
## coffeedata$Harvest.Year.Begin: 2013
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2014 2014 2014 2014 2015 2015
## ------------------------------------------------------------
## coffeedata$Harvest.Year.Begin: 2014
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2015 2015 2015 2015 2016 2017
## ------------------------------------------------------------
## coffeedata$Harvest.Year.Begin: 2015
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2016 2016 2016 2016 2017 2018
## ------------------------------------------------------------
## coffeedata$Harvest.Year.Begin: 2016
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2016 2017 2017 2017 2018 2018
## ------------------------------------------------------------
## coffeedata$Harvest.Year.Begin: 2017
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2018 2018 2018 2018 2018 2019
## ------------------------------------------------------------
## coffeedata$Harvest.Year.Begin: 2018
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2019 2019 2019 2019 2019 2019
Looking at altitude averages.
There was a very big outlier here in altitude, which makes me think it was a mistake. Noting that, I scaled down my plot. The median is a more reliable source of average altitude noting the strange outlier.
The median altitude that coffee was grown in was 1311 meters. That’s 4,301 foot elevation.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 1100 1311 1784 1600 190164 227
## [1] 4301.181
Beans are inspected for defects. Category one defects would be things like rotten beans. Category two defects are things like broken beans.
I noticed there were a lot more category two defects. This makes sense because category one defects should be zero and category two defects should be less than 5.
I created another column called Quality.Standards. This new variable will state TRUE if the beans have less than 0 category one defects, and less than 5 category two defects.
##
## FALSE TRUE
## 418 893
Here is a plot that visually shows the result of the new column.
There were 1311 rows of data and 44 columns initially. However, many of those columns were eliminated.
dim(coffeedata)
## [1] 1311 25
Quality Measure Meanings 10 best, 1 worst Scale by the SCAA (Speciality Coffee Association of America) Cupping is a process that involves roasting the coffee and simply brewing it by adding hot water to the ground beans
Aroma: +Aromatic aspects when infused with water Flavor +Taste and aroma, mid tones of coffee, based on flavor wheel Aftertaste +Duration of positive flavor attributes of coffee Acidity +Brightness (higher number) or sourness of coffee Body +Heaviness perceived on the tongue Balance +Overall rating of coffee Uniformity +Consistency of taste Cup Cleanliness +Transparency in the cup, should be free of off-flavors and defects Sweetness +Subtle pleasant sweetness in coffee Moisture +Should have a moisture content of 8 to 12.5%. Less/More will be low. Defects: +Info: Primary (e.g. black beans, sour beans) or Secondary (e.g. broken beans) +Primary: should have zero +Secondary: should have less than five SCAA Total Cup Points (100 point scale):
+90-100 - Outstanding +85-89.99 - Excellent +80-84.99 - Very Good +<80 - No scoring
Farm and Bean Data Country of Origin Farm Name Lot Number Mill Altitude Region Processing method Variety: +Type of arabica coffee *Color: +Grayish-blue, blue-green is most desirable, gradually dries in sun +Green is middle of the road +Green-brown scorched during drying or picked while under or over-ripe
This dataset contains qualtiy measurements that I have yet to explore. These ratings are per the Speciality Coffee Association of America and are categorical tests to determine delicious coffee vs. not delicious.
I am curious about delicious coffee and how it correlates to other variables such as altitude and defects. I would like to have the formula down for excellent coffee! I love coffee!
I know variety and processing of the individual beans also may will also be an important measure of quality beans. I also think that perhaps the origins of the coffee might be an interesting investigation.
I created a new variable called Quality.Standards. If the beans have no category one defects and less than 5 category two defects, then they are considered quality beans. When they meet quality standards, it will be marked as “TRUE”.
I noticed that with mean altitude there was some pretty strange outliers, one had a max of 190164 meters in altitude, which is impossible. I knew altitude wasn’t going to be an ongoing factor in the rest of my analysis so I chose not to remove it at this time.
I removed some columns from my dataset. The reason was either it had no relevance (such as Bag.Weight or Certification.Contact), or it had all the same data (Species), or had nearly no data in it.
I had to do some data cleaning especially with Harvest.Year. That field had all sorts of different formats of dates, and some had no years at all. I also changed it to Harvest.Year.Begin because it had both ranges (I’m assuming for harvest years that start at the end of the year and carry to the next). I also changed the Expiration date to be a year only rather than month day and year.
Origins of beans and how it effects quality. In other words, how does harvest year, country, and quality standards (bean defects) have a bearing on overall quality of the coffee per SCAA measurements?
Bean processing and how it effects quality. In other words, how does processing method, variety of bean, and the moisture of the bean have a bearing on overall quality of the coffee per SCAA measurements?
Now let’s take a look at how all the coffee origin information: Country.of.Origin, Harvest.Year.Begin, Quality.Standards and how that effects good coffee - which will be measured next to Total Cup Points.
The overall quality of the coffee test is Total Cup Points (Total.Cup.Points) and is as follows:
*SCAA Total Cup Points (100 point scale):
+90-100 - Outstanding +85-89.99 - Excellent +80-84.99 - Very Good +<80 - Below Grade
To help me in my analysis I’m going to create another column called Total.Cup.Result: Outstanding, Excellent, Very Good, Below Grade. I think it will add quick clarity and understanding to the visualizations.
## Total.Cup.Points Total.Cup.Result
## 2 89.92 Excellent
## 3 89.75 Excellent
## 4 89.00 Excellent
## 5 88.83 Excellent
## 6 88.83 Excellent
## 7 88.75 Excellent
## 8 88.67 Excellent
## 9 88.42 Excellent
## 10 88.25 Excellent
## 11 88.08 Excellent
## 12 87.92 Excellent
## 13 87.92 Excellent
## 14 87.92 Excellent
## 15 87.83 Excellent
## 16 87.58 Excellent
## 17 87.42 Excellent
## 18 87.33 Excellent
## 19 87.25 Excellent
## 20 87.25 Excellent
## 21 87.25 Excellent
## 22 87.17 Excellent
## 23 87.17 Excellent
## 24 87.08 Excellent
## 25 87.08 Excellent
## 26 86.92 Excellent
## 27 86.92 Excellent
## 28 86.83 Excellent
## 29 86.67 Excellent
## 30 86.58 Excellent
## 31 86.58 Excellent
## 32 86.50 Excellent
## 33 86.42 Excellent
## 34 86.33 Excellent
## 35 86.25 Excellent
## 36 86.25 Excellent
## 37 86.25 Excellent
## 38 86.25 Excellent
## 39 86.25 Excellent
## 40 86.17 Excellent
## 41 86.17 Excellent
## 42 86.17 Excellent
## 43 86.17 Excellent
## 44 86.08 Excellent
## 45 86.08 Excellent
## 46 86.08 Excellent
## 47 86.00 Excellent
## 48 86.00 Excellent
## 49 86.00 Excellent
## 50 86.00 Excellent
## 51 86.00 Excellent
## 52 86.00 Excellent
## 53 85.92 Excellent
## 54 85.92 Excellent
## 55 85.92 Excellent
## 56 85.83 Excellent
## 57 85.83 Excellent
## 58 85.83 Excellent
## 59 85.83 Excellent
## 60 85.75 Excellent
## 61 85.75 Excellent
## 62 85.75 Excellent
## 63 85.58 Excellent
## 64 85.58 Excellent
## 65 85.58 Excellent
## 66 85.50 Excellent
## 67 85.50 Excellent
## 68 85.50 Excellent
## 69 85.50 Excellent
## 70 85.50 Excellent
## 71 85.42 Excellent
## 72 85.42 Excellent
## 73 85.42 Excellent
## 74 85.42 Excellent
## 75 85.42 Excellent
## 76 85.33 Excellent
## 77 85.33 Excellent
## 78 85.33 Excellent
## 79 85.33 Excellent
## 80 85.33 Excellent
## 81 85.33 Excellent
## 82 85.33 Excellent
## 83 85.33 Excellent
## 84 85.25 Excellent
## 85 85.25 Excellent
## 86 85.25 Excellent
## 87 85.17 Excellent
## 88 85.17 Excellent
## 89 85.08 Excellent
## 90 85.08 Excellent
## 91 85.08 Excellent
## 92 85.08 Excellent
## 93 85.08 Excellent
## 94 85.08 Excellent
## 95 85.08 Excellent
## 96 85.08 Excellent
## 97 85.00 Excellent
## 98 85.00 Excellent
## 99 85.00 Excellent
## 100 85.00 Excellent
## 101 85.00 Excellent
## 102 85.00 Excellent
## 103 85.00 Excellent
## 104 85.00 Excellent
## 105 85.00 Excellent
## 106 85.00 Excellent
Result: The correlations are rather low, however, the category two does show a slightly higher correlation with total cup points then category one.
## Category.One.Defects Category.Two.Defects Total.Cup.Points
## Category.One.Defects 1.0000000 0.3422092 -0.1068260
## Category.Two.Defects 0.3422092 1.0000000 -0.2136031
## Total.Cup.Points -0.1068260 -0.2136031 1.0000000
The Total Cup Points by Country plot below gives a quick visual as to which countries have not only the most coffee samples, but also which have higher total cup points.
There are whole lot of outliers in the total cup points. While recognizing that outliers can distort statistical analysis, in this case I feel the outliers can be very informative about my subject-area of coffee being that coffee tasting results is so subjective. So, instead of removing the outliers, I chose to limit my x-axis and y-axis to help better visualize what’s going on, and I will compare median vs. mean to determine what would be a better measurement of “middle of the road” in future measurements.
Looking at the below data, it’s easier to see that Mexico, Columbia and Guatamala are the top producers.
## # A tibble: 37 x 5
## Country.of.Origin total_cup_mean total_cup__medi… total_cup_max n
## <chr> <dbl> <dbl> <dbl> <int>
## 1 Mexico 80.9 81.6 87.2 236
## 2 Colombia 83.1 83.2 86 183
## 3 Guatemala 81.8 82.5 89.8 181
## 4 Brazil 82.4 82.4 88.8 132
## 5 Taiwan 82.0 82 86.6 75
## 6 United States (Hawaii) 81.8 82.8 87.9 73
## 7 Honduras 79.4 81.7 86.7 53
## 8 Costa Rica 82.8 83.2 87.2 51
## 9 Ethiopia 85.5 85.2 90.6 44
## 10 Tanzania, United Republi… 82.4 82.2 86.5 40
## # … with 27 more rows
The below information shows the countries with the highest median of Total Cup Points.
## # A tibble: 37 x 5
## Country.of.Origin total_cup_mean total_cup__median total_cup_max n
## <chr> <dbl> <dbl> <dbl> <int>
## 1 United States 86.0 87.2 87.9 8
## 2 Papua New Guinea 85.8 85.8 85.8 1
## 3 Ethiopia 85.5 85.2 90.6 44
## 4 Japan 84.7 84.7 84.7 1
## 5 Kenya 84.3 84.6 86.2 25
## 6 Panama 83.7 84.1 85.8 4
## 7 Uganda 84.1 83.9 86.8 26
## 8 Ecuador 83.8 83.8 83.8 1
## 9 Colombia 83.1 83.2 86 183
## 10 Costa Rica 82.8 83.2 87.2 51
## # … with 27 more rows
We already know that 2012 was the greatest production year for coffee (between 2008 and 2018)
Which years appear to have the best total cup points? We can see from below that the best total cup points is 2014. 2012 definitely produced the most coffee. But most of the coffee production appeared below the mean line. So, it appears 2014 was not only a good year for coffee production wise, but it created some good quality coffee!
Now let’s take a look at the bean processing information: Variety, Moisture, Processing.Method
Let’s look how the moisture in the beans effects coffee tastes. With the moisture measure, it should have a moisture content of .08 (8%) to .12 (12%). Less/More will be considered to dry/wet for coffee standards.
When I first ran this, it has a low outlier. So I zoomed in to the Total Cup Points that are from 60 - 80. So I can get a better look. I also added a line measurement to get a good look to see if there was a correlation to increased moisture and good tasting coffee.
From the moisture visualization, it appears there is almost no correlation between moisture and total cup points. I verified that with the correlation test below.
##
## Pearson's product-moment correlation
##
## data: coffeedata$Total.Cup.Points and coffeedata$Moisture
## t = -4.5638, df = 1309, p-value = 5.495e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17808326 -0.07149403
## sample estimates:
## cor
## -0.1251497
By changing the visualization to a scatter plot and changing the alpha level, I could produce a sort of pseudo-heat map. This gives me better idea of how moisture effects good coffee.
According to the below and above information, the average moisture count is at about .08. And the majority of the data falls between .09 and .12. So it appears that the best coffee exists between .10 and .12. It seems that moisture makes an impact, it can’t be too much, and can’t be too little.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.09000 0.11000 0.08886 0.12000 0.28000
Next, it’s time to look at the variety of coffee. I grouped the data together by Variety and then summarized the mean and median.
## # A tibble: 6 x 4
## Variety total_cup_mean total_cup_median n
## <chr> <dbl> <dbl> <int>
## 1 Arusha 82.2 82.4 5
## 2 Blue Mountain 82.1 82.1 2
## 3 Bourbon 81.9 82.3 226
## 4 Catimor 83.3 83.2 20
## 5 Catuai 81.3 81.9 74
## 6 Caturra 82.4 83.1 256
In terms of discovering Variety, I’m pretty pleased. I can now tell just from these visualizations what are the three top tasting varieties of coffee beans.
The visualization supports my finding that Ethiopia produces some of the best tasting coffee (rated number 3). Kenya is just south of Ethiopia.
Processing method is the next item to look at.
## # A tibble: 6 x 5
## Processing.Method total_cup_mean total_cup_median total_cup_max n
## <chr> <dbl> <dbl> <dbl> <int>
## 1 Natural / Dry 82.4 82.8 89 251
## 2 Other 81.3 81.8 84.7 26
## 3 Pulped natural / honey 82.8 82.7 86.6 14
## 4 Semi-washed / Semi-pulped 82.6 82.5 86.1 56
## 5 Washed / Wet 82.0 82.4 90.6 812
## 6 <NA> 82.4 83.1 89.8 152
Inspired by another coffee lover, I also included this data so you can get an idea of what countries use what methods!
##
## Natural / Dry Other Pulped natural / honey
## Brazil 80 1 7
## Burundi 0 0 0
## China 3 0 1
## Colombia 27 0 0
## Costa Rica 0 1 2
## Cote d?Ivoire 0 0 0
## Ecuador 1 0 0
## El Salvador 1 0 0
## Ethiopia 17 0 0
## Guatemala 10 2 0
## Haiti 1 0 0
## Honduras 14 0 0
## India 1 0 0
## Indonesia 2 4 0
## Japan 0 0 1
## Kenya 2 0 0
## Laos 0 0 0
## Malawi 0 0 0
## Mauritius 0 0 0
## Mexico 17 0 0
## Myanmar 2 1 0
## Nicaragua 4 3 0
## Panama 1 1 0
## Papua New Guinea 0 0 0
## Peru 0 0 0
## Philippines 1 0 0
## Rwanda 0 0 0
## Taiwan 13 9 2
## Tanzania, United Republic Of 1 0 0
## Thailand 2 0 1
## Uganda 7 0 0
## United States 1 1 0
## United States (Hawaii) 40 0 0
## United States (Puerto Rico) 0 0 0
## Vietnam 3 3 0
## Zambia 0 0 0
##
## Semi-washed / Semi-pulped Washed / Wet
## Brazil 24 6
## Burundi 0 1
## China 0 12
## Colombia 0 121
## Costa Rica 1 45
## Cote d?Ivoire 0 1
## Ecuador 0 0
## El Salvador 1 15
## Ethiopia 0 8
## Guatemala 0 161
## Haiti 0 4
## Honduras 0 35
## India 0 0
## Indonesia 5 6
## Japan 0 0
## Kenya 0 22
## Laos 0 3
## Malawi 0 11
## Mauritius 0 0
## Mexico 14 198
## Myanmar 0 5
## Nicaragua 0 11
## Panama 0 2
## Papua New Guinea 0 1
## Peru 0 8
## Philippines 0 4
## Rwanda 0 1
## Taiwan 9 37
## Tanzania, United Republic Of 1 37
## Thailand 0 18
## Uganda 1 18
## United States 0 6
## United States (Hawaii) 0 9
## United States (Puerto Rico) 0 4
## Vietnam 0 1
## Zambia 0 1
On average, the Processing Method that produces the highest total cup points is the “Pulped natural/honey” method.
In case you are wondering:
The pulped natural / honey process begin the drying process directly after de-pulping rather than undergoing fermentation to remove the mucilage. “Pulped natural” tends to have more fruit and fermented flavors because the bean has more time to interact with the natural sugars from the cherry as enzymes break down the mucilage around the bean. If producers however aren’t careful about stirring and watching, funky flavors will emerge in the roasted coffee.
However, Washed / Wet coffee’s are known for their vibrant notes. Removing all of the cherry prior to drying allows the intrinsic flavors of the bean to shine without anything holding them back. Fruit notes are still found in washed coffees, however, fermented notes and berry notes are less common.
Natural / Dry method involves drying coffee cherries either patios or raised beds in the sun. This process only works in areas that are hot and dry and take to give the coffee a more fruity flavor.
I guess it’s a matter of taste!
I noticed that the U.S. coffee was rated the best tasting, followed by Papua New Guinea. Ethiopia was third and the best variety of coffee also was from Ethiopia (Kenya was just below it).
I expected to see more of a correlation, or at least an upward trending of moisture vs. total cup points where the more moisture the higher the cup points. However, I instead noticed that there was more of a pattern with a range of data using the scatter plot that produced the best coffee. It apparently can’t be too wet or dry.
For both variety and processing method, it was easier to see information when grouping the data and then looking at average, median and max. On average, the processing that produces the highest total cup points is the “Pulped natural/honey” method. However the max total cup point was the “Washed/wet method”.
Observing total cup points by harvest there was an interesting relationship with the quality standards which is the bean inspection. True indicates it passed quality standards for bean inspection. The Total Cup Points were less for those that didn’t pass quality standards.
The strongest relationship I found seemed to be the defects and total cup points. The lower quality standard of beans yields less delicious coffee. Makes sense!
Now it’s time to look at the data a little more closely. We’ll be adding some additional variables to our plots as well as looking at individual SCAA tests.
The lines provide a good quick glance at where the averages lie for Total Cup Points for coffee that met quality standards and those who didn’t. It’s interesting how the mean for total cup points plummeted for 2011. My thought is because it had some very low outliers or incorrect data which pulled it down. However, for that line to stay below the “met quality standards” is what I would expect.
Mexico was down low on the list of “excellent coffee” measured by total cup points despite being one of the highest producers. Is that because they don’t meet quality standards during the bean inspection process? Surprisingly no! Most of their beans appear to meet inspection.
That brings a suspicion to mind that most bad tasting coffee may be more related to the variety or processing of the beans than the picking of the correct beans. This supports the earlier plots I evaluated.
This plot shows that beans with higher quality standards do effect good coffee, but not by much. However there are a lot of outliers in the true category.
This supports the idea that the less defects the better the coffee to some degree. But if you look at Category Two, you can see with the U.S. and Kenya that’s not always the case.
Early on the processing method wasn’t reported (see NA below). Most of the top countries tend to favor either natural/dry or washed/wet with Japan picking pulped natural methods mostly. Later Kenya changed to the washed/wet method. The natural/dry is most effective in dryer climates. The natural/dry methods are the coffees that tend to have stronger fruit notes. The washed/wet method will bring out the non fruit notes because the fruit is removed. The change in processing could be because of weather, but may be because they wanted variety with the coffee beans.
There are a variety of tests and measures that are used to determine total cup points. The below matrix shows the correlation of all the testing measures.
## Aroma Flavor Aftertaste Acidity Body Balance Uniformity Clean.Cup Sweetness
## 1 8.67 8.83 8.67 8.75 8.50 8.42 10 10 10
## 2 8.75 8.67 8.50 8.58 8.42 8.42 10 10 10
## 3 8.42 8.50 8.42 8.42 8.33 8.42 10 10 10
## 4 8.17 8.58 8.42 8.42 8.50 8.25 10 10 10
## 5 8.25 8.50 8.25 8.50 8.42 8.33 10 10 10
## 6 8.58 8.42 8.42 8.50 8.25 8.33 10 10 10
These are all the items that go into the testing to determine the total cup points. Uniformity is a manner of testing between coffee tastes, clean cup is how it looks so it make sense that those items aren’t strongly correlated. The others all have to do with taste, so I can see that they are all strongly correlated. I am surprised by sweetness though.
I originally was thinking that defects didn’t matter so much with overall total cup points, However, the more I investigated, the more it strengthened the idea that defects did negatively effect total cup points.
I found it interesting at the strong correlation in the testing that went into total cup points, and how they correlated with each other. The “viewing” tests were not correlated with the other tests, which is what I would expect.
The Average Total Cup Points by Coffee Variety give a quick look as to what types of variety of beans to look for when finding good coffee. Higher on the scale for the variety shows the higher quality coffee. I found the results didn’t differ much using the median.
The Total Cup Points by Country show not only which country has higher total cup points, but also demonstrates that defects in the coffee do effect the taste and which country tends to have fewer defects. Even though when you buy coffee you won’t necessarily know which beans had defects. BUT it is useful to know, which countries tend to have less defects. This will effect the taste of coffee. For example, note that Ethiopian coffee had few defects.
The Processing Method of Top Countries show all the different processing methods. The processing methods are all a matter of taste. For example, if you prefer a fruity coffee then you’d lean toward a coffee that uses the Natural/Dry method where the fruit dries on the bean. It also shows the Harvest Year that those processes occurred.
This dataset had 1311 observations.
Initially I noticed there wasn’t much of a correlation between defects and good tasting coffee as well as moisture and good tasting coffee. But on more careful observation, I realized that when I compared the total cup points with country, and then added the third variable I created, quality standards, you could see that quality standards does effect the taste of coffee.
Also, I noticed that with moisture, there was no correlation from a numbers aspect. However if you look at it on the plot, good coffee occurs at a certain range of moisture.
This makes me realize that processing makes an impact on flavor of coffee more than I initially thought. However how is one to really know what processing is used? So, that’s where origins and patterns come in to play. The top five “total cup point on average” countries tend to be: 1. United States 2. Papua New Guinea, 3. Ethiopia, 4. Japan and 5. Kenya. It’s easy to say that United States has the best coffee, but I’m not sure I agree.
United States coffee has a lot of defects, but Ethiopia had the max total coffee points. Also the top variety of coffee was a Ethiopian variety. It leads me to believe that Ethiopian coffee is the true hero. So what about Papua New Guinea? Wasn’t that second? Yes. But, I also noticed a lot of their data reporting was blank, so it’s kind of an unknown. However, you can’t ignore the fact that taste tests show it is on average the second best tasting coffee. However, without consistent data you can make predictions for the future.
Also, it is a matter of taste and opinion. For example those that process with the Natural/Dry method will have less of a fruity taste.
One future work on this dataset would be to fill in a lot of the empty data. Also a lot of items such as Farm, Mill, and color of bean were missing, which may have made a big impact on how to find good coffee.